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Abstract 

Visualization software is widely used in scientific and engineering research. But computed visual- 
izations can be very misleading, and the errors are easy to miss. We feel that the software produc- 
ing the visualizations must be thoroughly evaluated and the evaluation process as well as the 
results must be made available. Testing and evaluation of visualization software is not a trivial 
problem. Several methods used in testing other software are helpful, but these methods are (ap- 
parently) often not used. When they are used, the description and results are generally not avail- 
able to the end user. 

Additional evaluation methods specific to visualization must also be developed. We present sever- 
al useful approaches to evaluation, ranging from numerical analysis of mathematical portions of 
algorithms to measurement of human performance while using visualization systems. Along with 
this brief survey, we present arguments for the importance of evaluations and discussions of ap- 
propriate use of some methods. 


1. This work is supported through NASA contract NAS2-12961. 
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Introduction 

Visualization software is becoming widely used as shown by the laige number of visualization en- 
vironments available today. To gain benefit from using such a system one must be able to distin- 
guish between aspects of the visualization which reflect the data and aspects due to the 
visualization process itself. 

Is the visualization valid? That is, does it show what the user intended? Is the visualization soft- 
ware verifiable? That is, can one demonstrate that the software does what the programmer intend- 
ed? And how can one evaluate a system’s capabilities and compare it with other systems? For the 
most part, the information necessary to answer these questions is not generated or, if it exists, is 
not public. Application scientists have sometimes addressed these questions [e.g. BUNI88], but 
mainly for their own software and application. Papers on the error characteristics of visualization 
techniques [e.g. DARM95, MARC94] are beginning to appear. Experiments using human sub- 
jects to get (relatively) hard data on visualization system performance are required for clinical use 
in medical fields [VANN94] and have begun for various other applications [e.g. BRYS95]. But 
much more work is needed. 

This paper is a brief look at some of what has been done, and some of what needs to be done to 
ensure that scientific visualization software is valid and verifiable. We examine standardized test 
suites, the error characteristics of the visualization process, and human experimentation to test the 
basic hypothesis of visualization: that visualization improves insight and task performance. 

Standardized test suites 

One approach to verification and evaluation is to run visualization software on standard test suites 
that, by general agreement, cover the important characteristics of certain classes of data. Different 
test suites are necessary for different problem domains; computational fluid dynamics, geophysi- 
cal simulations and radiological data, for example, present different challenges to visualization 
systems and therefore need different test suites. One can (and should!) use real scientific data sets 
to evaluate accuracy and performance, but these data sets are not sufficient. They were designed 
to investigate phenomena, not evaluate software. Thus, correct software behavior is difficult to 
verify and the various aspects of performance are difficult to asses. Verification suites and bench- 
marks specifically designed to test visualization systems would be of great value. 

Correctness (the software does what it claims) and accuracy (it maintains appropriate precision) 
are vital for acceptance by critical users and become more important when visualization software 
provides quantitative data (e.g. volume of a tumor or clearance between moving parts). Other than 
blind faith, there is no reason to believe the results from visualization systems are more than ap- 
proximately accurate most of the time. For example, one well known (but here nameless) visual- 
ization package uses Euler integration for streamlines, a technique well known to produce 
incorrect results in vortices! Worse, there is no accepted way to quantify error, which is always 
present to some degree. There are differences in the results using different packages, but no 
straightforward means of measuring how much the results vary, which is nearer the correct value, 
or if either is even close. 
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Furthermore, developers testing systems have no well established methods of determining the 
correctness of systems. Each development team creates its own ad hoc testing techniques. 

One may compare the performance of different packages on data sets from various application 
fields, but the results only apply to data with nearly identical characteristics. A well designed set 
of benchmarks will support independent characterization of various aspects of visualization pack- 
age performance (e.g., very fast streamlines but slow isosurface generation). High performance 
computing has many years of experience designing and examining the results of benchmarks [e.g. 
BAIL94]. One could do worse than follow their lead. 

Consider computational fluid dynamics (CFD). Correctness is particularly difficult in CFD visual- 
ization since computational grids often contain complex topologies including multiple, interpene- 
trating 3D blocks with some embedded points marked as not valid. Visualization software 
becomes complex to avoid computations using the masked data points and to perform properly in 
the presence of singularities, where an edge of a grid cell may have length of zero. Other applica- 
tions have different obstacles. 

CFD data sets are legendary for their size. The current state of the art in CFD simulations produce 
solutions over multiple three dimensional grids defined by several million points. Each grid point 
is represented by three floating point numbers (for location), and five floating point numbers per 
node are needed to represent the solution, resulting in tens or hundreds of megabytes to capture an 
instant of flow. But researchers are simulating unsteady flows, requiring a few thousand time 
steps. Current visualization environments have trouble with these laige CFD data sets. As a gener- 
al rule, limits on the size data set to be visualized are determined by the hardware available to the 
user. Developers may be able to give a minimum recommended configuration, but they cannot 
know what maximum to expect, and can not test all possible data set sizes. A set of variable sized 
benchmarks could provide a means for developers to discover and communicate the size limita- 
tions of their products. 

A good set of test data should have at least the following properties (in no particular order): 

1. The data sets should be scalable to allow tailoring for hardware configurations and the time 
available for performing the tests. 

2. Results should be easily, visually recognizable as correct or incorrect. Furthermore, quantitative 
results should be easily interpretable. 

3. The variety of grids should encompass normal cases and all known pathologies. 

4. Fields defined over the grids should also encompass normal cases and all known pathologies. 

5. Grids and fields of all the relevant dimensionalities should be represented. 

6. The suite should be easily distributable to most hardware and software platforms. 

7. The size (in bytes) of the distribution should be minimized. In particular, it is unlikely that data 
sets themselves should be distributed, but rather the code to generate them. 

8. It should be possible to run the suite in reasonable amount of time. 

9. The results should give unambiguous performance (as well as correctness) information. 

10. The suite should challenge current systems. 

In addition to test data sets, there is a need for a standard set of tests of particular visualization 
techniques and functions. 
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We examine a few ideas along these lines for techniques commonly used in CFD visualization: 
Integral Curves (particle traces): 

* Trace forward in time from a point, then trace backwards in time for the same total time from 
the end of the first trace. The distance between the start of the first trace and the end of the second 
is an error measure. Do this for many points in vector fields with various properties. Be sure that 
some of the traces pass near critical points of various types (saddles, nodes, etc.). Also, force trac- 
es through areas of rapid change. 

* Change the grid system in a variety of ways without changing the vector field defined over the 
gridded space, and compare traces starting from the same point. A constant vector field is useful 
here. Be sure to force traces across grid boundaries and near grid singularities. 

* In multi-zone data sets, start traces near grid boundaries and stop just after passing to the next 
grid zone to measure the time to pass between grid zones. Be sure to include a case with a grid 
singularity near the transition. 

Integral Surfaces [HULT90]: 

* Force surfaces around saddles. This is difficult because the surfaces tend to tear. 

* Integrate surfaces into vortices. This is difficult due to twisting of the surface. 

Isosurfaces: 

* Generate isosurfaces on a field where all isosurfaces are sets of spheres. Change the grid such 
that isosurfaces must be generated for all marching cube [LORE87] cases. 

Cutting Planes: 

* Examine results in overlapping grids to see how visualization of data in the overlap is handled. 

* Include benchmarks for time dependent data with static grids. This will reward the developer 
who reuses rather than recomputes values when possible. For example, when the location of the 
cutting plane remains constant (but the scalar field changes), the vertices and the interpolation fac- 
tors needed don’t change. 

Interpolation: 

* Design benchmarks to test interpolation time and precision, including non-linear interpolation. 
Vector field topology [HELM91, GLOB91]: 

* Place critical points in grid cells with grid singularities. 

* Place critical points on computational boundaries. 

These ideas are offered as initial suggestions, not final solutions. There are certainly other needs 
for other applications, and there are probably additional or even better tests for CFD data as well. 

Error characterization 

Visualization, particularly of numerical results, can be thought of as constructing models of mod- 
els of models. Even when visualizing experimental data the concept is the same, only the first few 
layers of models are different. Each model is an imperfect representation of the system from 
which it is derived, so error is introduced. 
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Consider the models constructed in the course of studying a physical system numerically with 
visualization in the loop. 
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The first step in numerical study of a physical system, say airflow over a wing, is selecting a con- 
tinuous mathematical model, in this case the Navier-Stokes equations. In their usual form, these 
equations presume an ideal gas, but air is not exactly an ideal gas. The scientist must understand 
the error introduced to ensure that it does not invalidate the desired results. 

Similarly, to solve continuous equations numerically on a computer often requires a discrete 
mathematical model. A given continuous mathematical model may be discretized in several ways, 
and at a variety of resolutions, each introducing different amounts and kinds of errors. Software 
implementing the discrete model introduces bugs, and execution of the software introduces 
round-off errors. All of these error sources are discussed in numerous textbooks, college courses 
and research articles. Scientists have substantial resources available to help understand the error 
and ensure that the simulation is faithful to the important aspects of the original physical system. 
Indeed, a researcher is expected to understand these issues and normally cannot publish in reputa- 
ble journals if the sources of error are not directly and successfully addressed. 

The story is quite different once the realm of visualization is entered. Even before the data reach a 
visualization system, several transformations may be applied. For example at one facility, CFD 
data are routinely truncated from 64 to 32 bits before visualization. This is usually acceptable, but 
occasionally causes problems for the unwary. Similarly, data are often transformed from cell cen- 
tered to the point centered schemes used by most visualization systems. These transformations 
introduce error, which must be understood to ensure it does not corrupt the analysis. 

Some visualization techniques assume continuous data fields. TTiese fields are simulated using 
interpolation between the points where data are available. Trilinear interpolation of hexahedral 
cells is a particular favorite. These interpolation schemes are seldom consistent with the interpola- 
tion used by the simulation software, introducing additional error. 
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Each visualization algorithm introduces its own set of errors. For example, particle tracers accu- 
mulate error since each point is generated from the previous point Different particle tracing tech- 
niques will occasionally generate quite different results [KENW93]. 

Another set of errors come from rendering. When a shaded polygon is rendered using the usual 
graphics shading techniques, interpolation is carried out in screen space. As a result, the color of a 
point on a surface is affected by the current viewing transformation! This may be acceptable for 
qualitative assessment, but the careful user must be aware of such problems. 

Better understanding of the numerical error in the individual processing stages is not good 
enough. How the various errors compound, cancel or otherwise interact is also important but little 
investigated. Changing the order in which operations are performed can cause significant differ- 
ences. Interpolating between grid node colors (as done by fast rendering hardware) often gives a 
different result than interpolating data values and then mapping to colors. To make a careful 
choice of appropriate visualization methods, a scientist needs the error characterization informa- 
tion. Since scientists will be investigating new phenomena, and applying visualization tools in 
unexpected ways, they must have access to the error analysis, not just its conclusions. 

It behooves the visualization developer to understand the sources of error in the visualization pro- 
cess, thoroughly document the errors - in the visualization itself if possible - and take steps to 
ensure that the magnitude of the error is much less than the errors introduced by the modeling pro- 
cess. After all, users want to see their data, not visualization system induced error. 

Finally, perceptual and cognitive effects color an investigator’s view of the pixels displayed on the 
screen. Hopefully, the user’s mind’s eye will help generate insight into the original physical sys- 
tem, spawning new hypotheses and experiments. How can we tell whether what is perceived illu- 
minates the scientist’s questions? 

Experiments with human subjects 

The grand hypothesis of the visualization community is that scientific visualization improves 
human insight. Several methods have been traditionally used to prove this hypothesis: 

Proof by repeated assertion 

Proof by vigorous gesticulation 

Proof by pretty picture 

(For extended comments relevant to Proof by Pretty Picture, see [GLOB94].) There are better 
approaches to proving the insight hypothesis. Most visualization practitioners have accumulated a 
great deal of anecdotal evidence. Such evidence could be, and should be, collected and carefully 
examined. Better yet, if one wants to prove that the state of a human being has changed, why not 
run a controlled experiment and measure the change of state? 

Ideally, one measures the insight of an individual, shows them a visualization, measures the 
insight again and studies the differences. Unfortunately, there is no general way to measure 
insight. However, good teachers write test questions to reveal a student’s insight (or lack thereof), 
so it may be possible in certain specialized circumstances. 

Although insight is difficult to measure, task performance can often be measured with some accu- 
racy. Thus, experiments to determine the efficacy of visualization systems and techniques for par- 
ticular tasks can, and should, be developed. 
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Although exploratory scientific visualization is difficult to break into specific tasks, engineering 
and medical visualization applications are fairly task specific and amenable to an experimental 
approach, as are other analytical visualization uses. 

It should be noted that experiments comparing visualization systems are easier to undertake and 
interpret than experiments to evaluate or characterize a particular system. We therefore focus dis- 
cussion on experimental comparison of visualization systems. 

Consider comparing two visualization systems for task performance. In theory, each system might 
be characterized as a point or region in a large multi-dimensional space. Although some dimen- 
sions can be given, e.g., techniques available, data formats supported and memory usage, not all 
the relevant dimensions are known or easily measured. Therefore, we have an unknown, probably 
non-linear function /. system -> task-performance. Each domain/range pair of this function can 
only be discovered by laborious, controlled experiment. Interpolation is fraught with danger, and 
extrapolation ridiculous. Obviously, characterizing this function — whose domain is not thor- 
oughly understood — is a daunting task. However, at present we have almost no experimental data 
points at all. Even a few widely space, but reasonably firm, data points in a single context and 
widely available would be of considerable value. 

In particular, one would like to predict performance for other, related tasks and predict the effect 
of changes to the system(s) under study. As mentioned above, predicting performance of tasks 
other than those studied is difficult. Predicting the effect of changes in the system is also difficult, 
in part because different aspects of systems interact to affect end user performance. 

A final word of warning for those considering controlled experiments of visualization systems, be 
careful to choose experimental subjects from the taiget user audience. For example, if insight into 
moderately experienced users is desired, don’t use programmers or novice users (or freshman 
psychology students) as experimental subjects. 

Conclusion 

The quality of a visualization system, method or a particular visualization, is almost an unknown 
concept. System comparisons are based on ease of use, flexibility, speed of operations and the 
presence or absence of particular features. None of these criteria are really relevant unless one can 
rely on the images produced to reflect the data without distortion or confusion. After all, it’s easy 
to make wrong pictures at sixty frames per second. 

Some of the work needed is just the direct application of numerical analysis techniques and dis- 
semination of the results. Other important research will involve fundamental work on identifying 
all the relevant parameters of “visualization quality” and their relationships. Then the scattering of 
points in this space where the “quality” function has been evaluated may allow us to see trends or 
unexplored combinations of parameters. 
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